0.1 About the dataset

The Global Terrorism Database (GTD) is the most comprehensive unclassified database of terrorist attacks in the world. The National Consortium for the Study of Terrorism and Responses to Terrorism (START) makes the GTD available via this site in an effort to improve understanding of terrorist violence, so that it can be more readily studied and defeated. The GTD is produced by a dedicated team of researchers and technical staff.

The GTD is an open-source database, which provides information on domestic and international terrorist attacks around the world since 1970, and now includes more than 200,000 events. For each event, a wide range of information is available, including the date and location of the incident, the weapons used, nature of the target, the number of casualties, and – when identifiable – the group or individual responsible. Link of the dataset: https://www.start.umd.edu/gtd/access/

1 Data Preparation

1.1 Load libraries

library(tidyverse)
library(data.table)
library(lubridate)
library(RColorBrewer)
library(gridExtra)
library(plotly)
library(ggthemes)
library(wesanderson)
library(leaflet)
library(VIM)

1.2 Load data

dt <- as.tibble(fread("globalterrorismdb_0221dist.csv",
                      na.strings = c("", "NA")))
dt
## # A tibble: 201,183 x 135
##         eventid iyear imonth  iday approxdate extended resolution country country_txt
##         <int64> <int>  <int> <int> <chr>         <int> <chr>        <int> <chr>      
##  1 197000000001  1970      7     2 <NA>              0 <NA>            58 Dominican ~
##  2 197000000002  1970      0     0 <NA>              0 <NA>           130 Mexico     
##  3 197001000001  1970      1     0 <NA>              0 <NA>           160 Philippines
##  4 197001000002  1970      1     0 <NA>              0 <NA>            78 Greece     
##  5 197001000003  1970      1     0 <NA>              0 <NA>           101 Japan      
##  6 197001010002  1970      1     1 <NA>              0 <NA>           217 United Sta~
##  7 197001020001  1970      1     2 <NA>              0 <NA>           218 Uruguay    
##  8 197001020002  1970      1     2 <NA>              0 <NA>           217 United Sta~
##  9 197001020003  1970      1     2 <NA>              0 <NA>           217 United Sta~
## 10 197001030001  1970      1     3 <NA>              0 <NA>           217 United Sta~
## # ... with 201,173 more rows, and 126 more variables: region <int>,
## #   region_txt <chr>, provstate <chr>, city <chr>, latitude <dbl>,
## #   longitude <dbl>, specificity <int>, vicinity <int>, location <chr>,
## #   summary <chr>, crit1 <int>, crit2 <int>, crit3 <int>, doubtterr <int>,
## #   alternative <int>, alternative_txt <chr>, multiple <int>, success <int>,
## #   suicide <int>, attacktype1 <int>, attacktype1_txt <chr>, attacktype2 <int>,
## #   attacktype2_txt <chr>, attacktype3 <int>, attacktype3_txt <chr>, ...

##There are 135 variables in the original data.We’ll select variables that are relatively easy to interpret and have less missing values: year, month, location, number of kill, ransom, suicide…

There are 135 variables in the original data.

gbtr <- select(dt, c(1,2,3,4,9,11,12,13,14,15,18,27,28,59,99,113,117))
gbtr$imonth[gbtr$imonth==0] <- NA
gbtr$iday[gbtr$iday==0] <- NA

gbtr2k <- gbtr %>% filter(iyear>=2000)
gbtr2k$imonth[gbtr2k$imonth==0] <- NA
gbtr2k$iday[gbtr2k$iday==0] <- NA

glimpse(gbtr)
## Rows: 201,183
## Columns: 17
## $ eventid     <int64> 197000000001, 197000000002, 197001000001, 197001000002, ~
## $ iyear       <int> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970~
## $ imonth      <int> 7, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ iday        <int> 2, NA, NA, NA, NA, 1, 2, 2, 2, 3, 1, 6, 8, 9, 9, 10, 11, 1~
## $ country_txt <chr> "Dominican Republic", "Mexico", "Philippines", "Greece", "~
## $ region_txt  <chr> "Central America & Caribbean", "North America", "Southeast~
## $ provstate   <chr> "National", "Federal", "Tarlac", "Attica", "Fukouka", "Ill~
## $ city        <chr> "Santo Domingo", "Mexico city", "Unknown", "Athens", "Fuko~
## $ latitude    <dbl> 18.45679, 19.37189, 15.47860, 37.99749, 33.58041, 37.00511~
## $ longitude   <dbl> -69.95116, -99.08662, 120.59974, 23.76273, 130.39636, -89.~
## $ location    <chr> NA, NA, NA, NA, NA, NA, NA, "Edes Substation", NA, NA, NA,~
## $ success     <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ suicide     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ gname       <chr> "MANO-D", "23rd of September Communist League", "Unknown",~
## $ nkill       <int> 1, 0, 1, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 1, 0, 0~
## $ nhours      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ ransom      <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

1.3 some sample values

head(gbtr)
## # A tibble: 6 x 17
##        eventid iyear imonth  iday country_txt region_txt provstate city  latitude
##        <int64> <int>  <int> <int> <chr>       <chr>      <chr>     <chr>    <dbl>
## 1 197000000001  1970      7     2 Dominican ~ Central A~ National  Sant~     18.5
## 2 197000000002  1970     NA    NA Mexico      North Ame~ Federal   Mexi~     19.4
## 3 197001000001  1970      1    NA Philippines Southeast~ Tarlac    Unkn~     15.5
## 4 197001000002  1970      1    NA Greece      Western E~ Attica    Athe~     38.0
## 5 197001000003  1970      1    NA Japan       East Asia  Fukouka   Fuko~     33.6
## 6 197001010002  1970      1     1 United Sta~ North Ame~ Illinois  Cairo     37.0
## # ... with 8 more variables: longitude <dbl>, location <chr>, success <int>,
## #   suicide <int>, gname <chr>, nkill <int>, nhours <dbl>, ransom <int>

1.4 Visualization of missing value

matrixplot(gbtr, sortby = c("nkill"))

aggr(gbtr, labels=names(gbtr),cex.axis = .9)

##Variables such as location, nhours, and ransom has large number of missing valid entries/values.EDA with these variables will be avoided for further reduction of complexity.

Variables such as location, nhours, and ransom has large number of missing values. EDA with thses variables will be avoided.

2 Analysis

2.1 Events by year

p <- gbtr %>% mutate(iyear=as.factor(iyear))  %>%
  group_by(iyear) %>% count() %>% 
  ggplot(aes(x=iyear,y=n,group=1)) +
  geom_line(size=1, color="brown")+
  geom_point(color="brown") +
  scale_x_discrete(
    breaks=c("1970", "2000","2008", "2011", "2014","2017")
    ) +
  labs(title = "Event by year", x = "year", y = "count")+
  theme_economist() 
p

There is a rapid increase in terrorist event since year 2000. We’ll seperately observe the trend by the region.

2.2 Overall trend in each region

p4 <- gbtr %>% count(region_txt, iyear) %>% 
  ggplot(aes(iyear, n,color=region_txt)) +
  geom_line(aes(group=region_txt)) +
  labs(title = "Trend by Region", x="year", y="count", color="region")+
  theme_light()
ggplotly(p4)

Hovering over the plot to see region label Middle East & North Africa and South Asia are the regions mainly responsible for the spike in data.

2.3 Events & num. of kills by region

Since there is a steep upward trend since aproximately year 2000, we’ll inspect the period before and after 2000 seperately.

p2 <- gbtr %>% mutate(pd=ifelse(iyear<2000,"before 2000", "after 2000")) %>%
  mutate(pd = factor(pd, levels = c("before 2000", "after 2000")))%>% 
  group_by(region_txt, pd) %>% count() %>%
  ggplot(aes(x=reorder(region_txt, n), y=n))+
  geom_bar(aes( fill=pd), stat= "identity", position = "dodge")+
  labs(title = "Events by region", x = "region", y = "count", fill = "period")+
  theme_economist()+
  scale_fill_manual(values = c("#66b2b2","#006666")) +
  coord_flip() 

p2

  • The region with the most terrorist attack bacame “Middle East & North Africa” after 2000. (“South America” before 2000).

  • “South Asia” saw the largest increase in terrorism since the 70s.

2.4 Number of deaths and number of events

pkr <- gbtr2k %>% filter(!is.na(nkill)) %>%  group_by(region_txt) %>% 
  summarise(ksum=sum(nkill)) %>% 
  ggplot(aes(reorder(region_txt,ksum), ksum))+
  geom_bar(stat = "identity", fill="#2E8B57")+
  coord_flip()+
  labs(title = "Num. of kills by region", subtitle = "without missing values, after 2000", x="region", y="count")+
  theme_economist()

  
per <-   gbtr2k %>% group_by(region_txt) %>% count() %>% top_n(10,n) %>% 
  ggplot(aes(x=reorder(region_txt, n), y=n))+
  geom_bar(stat= "identity", fill="#006666")+
  labs(title = "Events by region",subtitle = "after 2000", x = "region", y = "count")+
  theme_economist()+
  coord_flip()
  
  
grid.arrange(pkr,per,ncol=2)

  • South Asia has the largest num. of kills (other than “Sub-Saharan Africa”, “Middle East & North Africa” ) despite the missing values.
  • North America has higher number of kills than Western Europe and South America, even though there is less attacks.

2.5 Events by country

We’ll look at data after year 2000

pec <- gbtr2k %>% group_by(country_txt) %>% count() %>% ungroup() %>%
  top_n(n=20,wt = n) %>% 
  ggplot(aes(reorder(country_txt, n), n))+
  geom_bar(stat = "identity", fill="#21618C") +
  labs(title = "Event by country", subtitle = "after 2000", x = "Country", y = "Count") +
  theme_economist() +
  scale_fill_manual(values = wes_palette(n=4,"Cavalcanti1"))+
  coord_flip()
pec

2.6 Suicide attack

dtscd <- gbtr2k %>% filter(!is.na(suicide)) %>%  group_by(region_txt, suicide) %>% count()  %>%
  ungroup() %>% group_by(region_txt) %>% mutate(pct=n/sum(n)) %>% filter(suicide==1)

ggplot(dtscd, aes(reorder(region_txt, pct), pct*100)) +
  geom_bar(stat = "identity", fill="#5D6D7E")+
  coord_flip()+
  labs(title = "Pct of suicide attack by region", subtitle = "after 2000", x="region",y="%")+
  theme_economist()

2.7 Groups, attacks, and suicide

gbtr %>%filter(gname!="Unknown") %>%  group_by(gname,suicide) %>% summarise(n=n()) %>%
  ungroup() %>% group_by(gname) %>% mutate(sum=sum(n)) %>% ungroup() %>%  top_n(30,sum) %>% 
  ggplot(aes(x=reorder(gname,sum),n, fill=factor(suicide, levels = c(1,0)))) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Groups and attacks", x="groups", y="attacks", fill="suicide") +
  theme_economist_white() +
  scale_fill_manual(values = wes_palette(n=2, "Cavalcanti1"))
## `summarise()` has grouped output by 'gname'. You can override using the `.groups` argument.

Disregarding the “Unknown” groups

  • Taliban, ISIL, SL is reponsible for the most attacks.
  • ISIL carried out the most suicide attacks.(23%)

2.8 Attack type by region

wp <- dt %>% select(1,2,3,4,9,11,13,14,15,27,28,30,59,83,85,99,102,117)
wp$imonth[wp$imonth==0] <- NA
wp$iday[wp$iday==0] <- NA
patkrg<- wp %>% group_by(region_txt, attacktype1_txt) %>% count() %>% 
  ggplot(aes(region_txt, n, fill=attacktype1_txt)) +
  geom_bar(stat = "identity",position = "stack")+
  scale_fill_manual(values = wes_palette("Darjeeling1" ,n=9, type="continuous"))+
  theme_economist()+
  scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 0.8))+
  labs(title = "Attack type by region",
       x="region", y="num.", fill="attack type")

patkrg2<- wp %>% group_by(region_txt, attacktype1_txt) %>% count() %>% 
  ggplot(aes(region_txt, n, fill=attacktype1_txt)) +
  geom_bar(stat = "identity",position = "fill")+
  scale_fill_manual(values = wes_palette("Darjeeling1" ,n=9, type="continuous"))+
  theme_economist()+
  scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 0.8))+
  labs(title = "Attack type by region",
       x="region", y="num.", fill="attack type")

patkrg

patkrg2

  • Western Europe has more “Facility/Infrastructure Attack” (in number) than any other region.
  • Bombing/Explosion is the most common attack type in Middle East & North Africa.

2.9 Attack type by group

Different groups might prefer different types of attack method. There are 3671 groups in the data. We’ll look at the groups with the most attacks.

wp %>% filter(gname %in% grp$gname)%>% 
  group_by(gname, attacktype1_txt) %>% count() %>% 
  ggplot(aes(gname, n, fill= attacktype1_txt))+
  geom_bar(stat = "identity",position = "stack")+
  scale_fill_manual(values = wes_palette("Darjeeling1",n=9, type="continuous"))+
  theme_economist()+
  scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 5))+
  labs(title = "Attack type by groups", subtitle = "Groups with the most attacks",
       x="groups", y="pct", fill="attack type")

  • Armed assault is common in most groups except IRA which prefers assassination next to bombing.
  • Bombing is the most used attack type by ISIL.
  • 34% of ISIL’s bombing attack is suicide attack.
wp %>% filter(attacktype1_txt=="Bombing/Explosion" & gname %in% grp$gname ) %>% 
  group_by(gname, suicide) %>% count() %>% ungroup() %>% group_by(gname) %>% mutate(pct=n/sum(n)) %>% filter(suicide==1) %>% arrange(desc(pct))
## # A tibble: 6 x 4
## # Groups:   gname [6]
##   gname                                       suicide     n     pct
##   <chr>                                         <int> <int>   <dbl>
## 1 Boko Haram                                        1   510 0.527  
## 2 Islamic State of Iraq and the Levant (ISIL)       1  1369 0.317  
## 3 Taliban                                           1   727 0.216  
## 4 Al-Shabaab                                        1   192 0.125  
## 5 Kurdistan Workers' Party (PKK)                    1    27 0.0310 
## 6 Houthi extremists (Ansar Allah)                   1     2 0.00142

2.10 Number of death by attack type and region

wp %>%  filter(!is.na(nkill)&attacktype1_txt!="Unknown") %>%
  group_by(region_txt,attacktype1_txt) %>% 
  summarise(sumk=sum(nkill), event=n(), kperattack=sum(nkill)/n()) %>% 
    ggplot(aes(reorder(attacktype1_txt, kperattack), kperattack))+
  geom_bar(aes(fill=region_txt), stat = "identity")+
  coord_flip()+
  facet_wrap(.~ region_txt, ncol = 4, scales = "free_x")+
  labs(title = "num. of death by attack type and region", x="attack type", y="death per event")+
  scale_fill_manual(values = wes_palette("Darjeeling1", n=12, type = "continuous"))+
  theme(legend.position = "none")
## `summarise()` has grouped output by 'region_txt'. You can override using the `.groups` argument.

  • Types of attack that cause the most death/attack is drastically different from region to region.

  • Bombing (to my surprise) isn’t responsible for the most death/attack. Instead it’s armed assault and hostage taking in most region.

  • Hostage taking has the most death/attack in East Asia, Eastern Europe, Middle East & North Africa, South Asia, Southeast Asia, Sub-Saharan Africa and Western Europe.

  • North America’s extreme data reflects 9/11 attacks on 2001, with nearly 3,000 recorded deaths in 4 attacks.